About the project

This is a playground for experimenting with something called R Markdown.

At first glance, R Markdown appears to be a fairly simple mark-up language whose raison d’etre is to enable the embedding of “live” R code into web pages and other hypertext-compatible documents.

Some useful links are listed below:

Repository page

R Markdown basics (same link as on MOOC)

Embedding R code into a document (same link as on MOOC)

R Markdown cheatsheet (a different one!)

Regular expression cheatsheet (we linguists use regexes a lot!)

The theme used for this page is called readable.


Linear Regression in R

1. Reading the Data into R

You can use read.csv() to read a comma-separated table (.csv) into R, but it does require prior installation of the readr library. If you have RStudio, I warmly recommend using the interactive ‘import dataset’ menu instead.

As an alternative, read.table() is more generic than read.csv() and comes with base R. You have to specify a lot more stuff with it, such as the cell separator character (sep=“,”).

library(readr) learning2014 <- read_csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt")

2. Surveying and Visualizing the Data

Here are some ways to overview the data, which is always a good first step:

Show the dimensions of the data frame:

dim(learning2014)

dim(learning2014)
## [1] 166   7

Print a summary of its variables (columns):

summary(learning2014)

summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

The following command opens a new window (or tab) in which you can interactively browse and explore the dataset (though if you’re using a pop-up blocker you’ll likely see nothing):

View(learning2014)

View(learning2014)

This dataset was apparently collected to investigate links between academic performance (variable “points”) and various personality and demographic factors. As is customary, each row represents one person from whom these variables were measured.

2.1. Analyzing the Data with Pretty Scatter Plots

The following libraries are needed for the prettiest charts and graphs:

library(ggplot2)

library(GGally)

Let’s print out a hugely informative array of plots:

ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

library(ggplot2)
library(GGally)
ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

This ggpairs() is a more advanced and sophisticated alternative to base R’s pairs(). What both do is draw a scatter plot correlating each pair of variables found in the dataframe (or just the pairs that the user specifies).

No highly obvious correlations emerge from the scatter plots. One predictable association can be discerned – namely, the one between exam scores and attitude towards the subject. The plot also seems to suggest a very slight negative correlation between age and points. Thirdly, a glance at the correlation coefficients reveals that predictor surf has the second-highest absolute value (distance from zero) after attitude. We’ll include these three explanatory variables in our linear regression analysis.

3. Correlating points with attitude, age, and surf using Linear Regression

Linear regression measures correlation between a continuous outcome variable (AKA regressand or y variable) and one or more input variables, or predictors or x-variables, which can likewise be continuous but do not have to be. In linear regression we test the hypothesis that the outcome variable is a function, firstly, of each predictor multiplied by their regression coefficients and, secondly, an error term representing the distance by which the model fails to match the exact behavior of the outcome variable.

summary(lm(points ~ attitude + age + surf, data=learning2014))

summary(lm(points ~ attitude + age + surf, data=learning2014))
## 
## Call:
## lm(formula = points ~ attitude + age + surf, data = learning2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.973  -3.430   0.167   3.997  10.444 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.85628    3.53786   4.765 4.18e-06 ***
## attitude     3.42388    0.57353   5.970 1.45e-08 ***
## age         -0.08713    0.05361  -1.625    0.106    
## surf        -0.96049    0.79943  -1.201    0.231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.294 on 162 degrees of freedom
## Multiple R-squared:  0.2082, Adjusted R-squared:  0.1935 
## F-statistic:  14.2 on 3 and 162 DF,  p-value: 2.921e-08

Perhaps the most important figures in this printout are:

  1. The Estimate column, which estimates how much (and in what direction) the regressand’s the value changes when a given predictor’s value changes by 1 unit.
  2. The Pr(>|t|) column, which measures the statistical significance of each predictor, and
  3. The Multiple and Adjusted R-squared values, which reflect the overall prediction power of the model.

As common sense suggests, attitude shows the strongest correlation with points. The effect is statistically significant at the highest level. Age and surface learning also exert an effect, but their respective effects do not reach the required .05 level of statistical significance. It is very important to note, however, that age comes much closer to statistical significance than surf even though the absolute value of its regression coefficient is much smaller. What’s going on?

## 
## 
## 
## 
## 
##  <think break>
## 
## 
## 
## 
##  

These figures reflect the fact that surf and age are measured on different scales. Surface learning ranges only from 1 to 5, while age (in our dataset) ranges from 17 to 55. Although a change by one single unit in surf is more consequential than a change of one year in age, the latter variable has much more explanatory power when its entire range of variation is taken into consideration.

Remember that my eye caught the correlative pattern in the points by age scatter plot but not in age by surf? This is a good example of how graphical presentations of data are often more informative than cold numbers, demonstrating the wisdom of Kimmo’s “always draw plots” principle.

However, both age and surf must now be dropped from the model due to their statistically non-significant effect.

4. Removing the Non-Significant Predictors and Interpreting the Model

summary(lm(points ~ attitude, data=learning2014))

summary(lm(points ~ attitude, data=learning2014))
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The Residuals table quantifies the error term. For example, the worst overestimation of the model is by almost 17 points (this student massively underachieved compared to the model prediction), while the worst overestimation is by 10 points (this student exceeded the model’s expectations by 10 exam points).

Intercept reports the point at which the model intersects the Y axis. We might also think of it as the value of the outcome variable before the effects of any explanatory variable have been modeled.

I’m not entirely sure precisely what the Std.Error column measures. My guess is that it might reflect the average amount by which the predictions based on attitude * Coefficient miss the real, observed change in points.

The column entitled t-value report the t test statistic. Bigger is better (more significant), but the T-test always requires the Degrees of Freedom to be factored into the equation. It is thus not an entirely straightforward indicator of predictor significance – that is what the Pr(>|t|) value is for. If the Pr(>|t|) value has either a dot or asterisks next to it, then that predictor is considered statistically significant.

I’m not sure what exactly is measured by Residual standard error, although it clearly somehow measures the variance of the residuals.

Multiple R-squared indicates the percentage of the overall variation in points that can be predicted by all the predictors included in the model together. Here, attitude is the only one. Adjusted R-squared is an estimate of R-squared that penalizes for too many predictors – therefore its value here (with only one predictor) is nearly identical to un-adjusted R-squared.

The F-statistic tests the highly skeptical hypothesis that the model has zero statistically significant predictive power. This hypothesis usually, as now, is proven false, as indicated by the extremely small p-value.

5. Diagnostic Plots

plot(lm(points ~ attitude, data=learning2014),which=1)

plot(lm(points ~ attitude, data=learning2014),which=1) 

Residuals vs Fitted plotting is used to analyze whether there is any systematic pattern to the model error. Such a pattern would be bad news because it would cast doubt on one of the crucial premises of linear regression, namely, that the variance of the residuals is random. Now we can see slight heteroscedasticity in the model, i.e. the model appears to predict high exam scores slightly more reliably than low ones. This pattern is weak and only barely discernible though.

plot(lm(points ~ attitude, data=learning2014),which=2)

plot(lm(points ~ attitude, data=learning2014),which=2) 

Q-Q plots are used to analyze the validity of another model assumption – that the residuals of the model are normally distributed. In an ideal case, the observations (dots) would align neatly along a straight line. Now we can see that this is not entirely the case – large negative residuals are more frequent than expected, while large positive residuals are less frequent than expected.

plot(lm(points ~ attitude, data=learning2014),which=5)

plot(lm(points ~ attitude, data=learning2014),which=5) 

This plot illustrates Leverage, i.e. how influential each observation (data point) is in determining the x-variable’s regression coefficient. The idea is to ascertain that there are no outliers skewing the model, i.e. exerting a massive impact on the angle of the regression line. Any highly unusual observation at either extreme of the x-variable’s distribution is potentially an influential outlier. In our case, we can see that no hugely influential observations exist – the scale of the leverage goes only from 0 to about .05, so there are practically no outliers with disproportionate leverage.


Logistic Regression in R

Welcome to the logistic regression chapter! Here we will practise predicting categorical outcomes, which are trickier to model than continuous ones because they require us to calculate probabilities. Probabilities do not range between negative and positive infinity like continuous response variables tend to, so in order to make the regression coefficients more analogous to those in linear regression, we need a link function inbetween. We’ll start by importing our training dataset into R:

alcohol<-read.table(file = "drunkards.txt", sep = "\t", header = TRUE, row.names = NULL)

This data was collected to analyze correlations between alcohol consumption and various social and demographic factors.

1. My 4 Variables

It was already seen in the Datacamp exercise that sex correlates heavily with drinking habits. This is unsurprising. The reasons behind the phenomenon are rather obvious and need not be discussed here. Therefore, I will leave this variable out of my own analysis and instead choose the following 4:

  1. Age: a somewhat boring choice, but I wanted some predictors to be continuous; I predict that in the age range represented in the data, higher age correlates with more drunkenness
  2. Reason: I predict that having chosen one’s school due to its geographical closeness to home is an indicator of low study motivation and correlates with increased alcohol consumption
  3. Studytime: I predict that weekly study time is inversely correlated with alcohol consumption
  4. Romantic: I predict that being single correlates with increased alcohol consumption

Let’s see what we can find out.

Predictor: age

table(alcohol$age)
## 
##  15  16  17  18  19  20  22 
##  81 107 100  81  11   1   1

I chose age as one of the variables because I wanted some of my variables to be continuous. There are other continuous variables in the data – such as grades and rates of failure and absence. However, I felt that these variables present serious chicken-or-egg problems for the interpretation of results: even if bad grades are found to correlate with high alcohol use, can we really claim that the bad grades came first, i.e., caused the drinking and not the other way around? I find such a premise questionable. Age has no such problems: it is always the cause, never the effect (everyone grows older at the same rate regardless of their drinking habits).
There are only 7 age groups in the dataset, two of which are so under-represented as to be basically negligible. There are very few students over the age of 18 in the dataset. Findings based on 11 observations or less are rarely statistically significant. We should be able to get something out of the data on the other age groups though.

We will start drawing pictures now. Let us therefore load ggplot2 into memory, as it seems to be more flexible and customizable than base R’s graphics library.

library(ggplot2)

First, let’s look at a bar chart of high alcohol use as function of age

ggplot(data=alcohol, aes(alcohol$age, fill=alcohol$high_use)) + geom_bar(position = "dodge") + ylab("count") + ggtitle("High alcohol consumption as a function of age", subtitle="N = 382") + scale_alpha_continuous(breaks = min(alcohol$age):max(alcohol$age)) + xlab("age") + scale_x_continuous(breaks=15:22)

This is not a bad way to visualize the frequency of a phenomenon across groups. We can see that the proportion of heavy drinking seems highest in the 17-year-olds. However, since the age groups vary in size, I find comparison easier using a percent-stacked bar chart instead. The tradeoff is that we lose sight of the absolute group sizes:

ggplot(data=alcohol, aes(alcohol$age, fill=alcohol$high_use)) + geom_bar(position = "fill") + ylab("relative frequency") + ggtitle("High alcohol consumption as a function of age", subtitle="N = 382") + scale_x_continuous(breaks = min(alcohol$age):max(alcohol$age)) + xlab("age")

This shows us that my initial visual impression was inaccurate – the 19-year-olds are proportionately (slightly) heavier drinkers than the 17-year-olds. This data could hardly be said to support the hypothesis that heavy alcohol consumption becomes more frequent with age, however. Rather, what seems to happen is that it increases for a couple of years after the first taste, then stabilizes in the late teens.

We could now assess the the statistical significance of these findings using the chi-squared test, but I’ll move on to the next variable instead in order to leave some analysis for logistic regression to do.

Predictor: reason

summary(alcohol$reason)
##     course       home      other reputation 
##        140        110         34         98

This variable has a more convenient frequency distribution of levels (different values that it can have) than the previous one: no single one of its groups is grossly underrepresented the way some of the age groups were. Now that we know that each group is sufficiently represented, let’s visually assess their respective rates of high alcohol use:

ggplot(data=alcohol, aes(alcohol$reason, fill=alcohol$high_use)) + geom_bar(position = "fill") + ylab("relative frequency") + ggtitle("Alcohol consumption and reason of school choice", "N = 382") + scale_x_discrete(labels=levels(alcohol$reason)) + xlab("Reason for having chosen one's current school")

This variable appears to exert a significant effect! Contrary to my hypothesis, however, it is not the “home” group that exhibits the highest rate of heavy alcohol consumption – it’s the “other” group, i.e., the wastebasket category reserved for the miscellanea of observations not falling into any of the three “main” categories of this variable. It is unfortunate that the researchers have chosen not to record the information on these “other” reasons for choosing a school – important information was lost with that decision. Other than that, the “home” group does indeed display an increased incidence of high use relative to the other explicitly defined groups.

Again, we’ll postpone further analysis to later stages.

Predictor: studytime

table(alcohol$studytime)
## 
##   1 1.5   2 2.5   3   4 
## 100   4 187   4  60  27

Uh oh. Something goofy is afoot. What’s up with the levels “1.5” and “2.5”, and why are their frequencies so darn low? If we take a look at the metadata, we see that studytime was designed to be an ordinal variable with levels from 1 through 4. Then where have the intermediate levels 1.5 and 2.5 come from?

## <think break>
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## </think break>

They come from the Data Wrangling phase of this exercise, where the instructions were that whenever there were two observations of a given numeric variable (one from each of the two original datasets), we had to use the mean of these two values in the new combined dataset. In my estimation, this is not a viable approach to studytime because it is not a continuous variable. Its original levels were less than 2 hours, 2 to 5 hours, 5 to 10 hours and over 10 hours of weekly study. Since the levels are not evenly spaced, can we really calculate an average?
IMHO a better approach would have been to treat studytime as essentially categorical, using only the value from the first of the source datasets (students_mat) whenever the two datasets differed. This would be methodologically consistent with the treatment of the other categorical (non-numeric) variables in the dataset.

Unfortunately, there is no easy way to identify which rows of the source dataset student_mat correspond to these goofy values so that they could be used as replacements. This is because the rows in the two source datasets lack a unique identifier field (which IMHO is a methodological error). Even the combination of 13 columns that we used as student identifiers when merging the data frames is not reliable – there are rows with identical sequences of values in these columns. Therefore, to replace the goofy studytime values with the original integer ones from student_mat, I made a temporary adjustment to the code I had used to combine the two datasets in the previous section, simply adding one condition to the for-loop to give studytime the same treatment as the non-numeric variables:

if (startsWith(colnames(two.columns)[1],"studytime")) {AlcoholConsumption[variable] <- two.columns[,1]}

The resulting column uses the original integer values from student_mat. We can now add it to our dataset as a new column. The old studytime with the goofy decimal values will remain intact, however – we’re just bringing in a better alternative to it. This is a good rule of thumb when modifying existing variables – always keep the original data too.

alcohol$studytime.adjusted <- AlcoholConsumption$studytime

Let’s take a look at the contents of our new column:

table(alcohol$studytime.adjusted)
## 
##   1   2   3   4 
## 103 190  62  27

Much better. Now there are no ridiculously underrepresented groups. However, we should make another little adjustment in order to prevent R from treating this variable as continuous. This is easily accomplished by converting our new studytime.adjusted variable into a factor. A factor is a vector whose values are treated as categorical and can be named and ordered however we choose. In our case, we want to preserve the original ordering. However, we do want to change the names of the levels – the labels 1, 2, 3 and 4 are a bit misleading as they might lead us to assume that the levels are evenly spaced. We will replace these numeric labels with new ones that accurately reflect what the levels represent. These adjustments are easy to do in R:

alcohol$studytime.factor <- as.factor(alcohol$studytime.adjusted) #Create a version of studytime.adjusted that is categorical
levels(alcohol$studytime.factor) <- c("<2h","2-5h","5-10h",">10h") #Name its levels appropriately

Now the frequency table looks like this.

table(alcohol$studytime.factor)
## 
##   <2h  2-5h 5-10h  >10h 
##   103   190    62    27

With these conversions made, let’s take a look at how the data compares to our hypothesis that weekly study time is inversely correlated with risk of high alcohol use. Here’s the percent-stacked bar chart.

ggplot(data=alcohol, aes(alcohol$studytime.factor, fill=alcohol$high_use)) + geom_bar(position = "fill") + ylab("relative frequency") + ggtitle("High alcohol consumption as a function of weekly study time", "N = 382") + scale_x_discrete(labels=levels(alcohol$studytime.factor)) + xlab("weekly study time")

This data supports my fairly self-evident hypothesis that diligent students are less often heavy drinkers than lazy ones.

Predictor: romantic

Alcohol lowers inhibitions and boosts libido. This is why it has been a powerful aid in human pair-bonding for aeons. In fact, this might just be the most important reason it is used in the first place, whether people admit it or not. Therefore, the assumption that people who are in a relationship, i.e. have already pair-bonded, have less reason to use alcohol was a fairly obvious one. As a binary variable, this one should be the easiest of the four to analyze.

Contingency tables (aka cross-tabulations) are a handy way of assessing correlations between variables that can be represented in 3x3 or smaller tables. We use them a lot in linguistics, e.g. when assessing whether the frequency distribution of a categorical variable such as that-omission vs. that-retention has remained constant or changed between two time periods. High alcohol consumption as a function of relationship status is a virtually identical phenomenon from a methodological standpoint. We’ll start by printing a contingency table of raw frequencies:

addmargins(table(alcohol$high_use, alcohol$romantic, dnn = c("high_use", "has_bf/gf")))
##         has_bf/gf
## high_use  no yes Sum
##    FALSE 180  88 268
##    TRUE   81  33 114
##    Sum   261 121 382

Hmm…not nearly as much of a difference as I expected. For a clearer picture, a table of relative frequencies is useful:

round(prop.table(table(alcohol$high_use, alcohol$romantic, dnn=c("high_use","has_bf/gf")), 2), digits=2)
##         has_bf/gf
## high_use   no  yes
##    FALSE 0.69 0.73
##    TRUE  0.31 0.27

Students with a romantic partner do seem less likely to be heavy drinkers, but the difference is only 4 percentage points in this dataset. Now I’m particularly tempted to test the significance of the effect with the chi-square test, but I’ll refrain. Logistic regression should shed more light on all that.

2. Binary Logistic Regression with 4 Independent Variables

model<-glm(high_use ~ age + reason + studytime.factor + romantic, data = alcohol, family = "binomial") #Creates the model
logodds<-coef(model) #Extracts and stores just the log of odds -values produced by the model
ORs<-cbind(data.frame(Estimate = round(exp(logodds),digits=3)),round(exp(confint(model)),digits=3)) #Extracts and stores the odds ratios and their confidence intervals, rounded to 3 decimals, in an eye-friendly table
## Waiting for profiling to be done...
summary(model) #Prints out the results of the model
## 
## Call:
## glm(formula = high_use ~ age + reason + studytime.factor + romantic, 
##     family = "binomial", data = alcohol)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5858  -0.8676  -0.6575   1.2070   2.3469  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -4.2950     1.6942  -2.535 0.011239 *  
## age                     0.2322     0.1015   2.286 0.022235 *  
## reasonhome              0.2789     0.2844   0.981 0.326773    
## reasonother             0.8068     0.4083   1.976 0.048138 *  
## reasonreputation       -0.2054     0.3236  -0.635 0.525604    
## studytime.factor2-5h   -0.4035     0.2652  -1.521 0.128178    
## studytime.factor5-10h  -1.4077     0.4216  -3.339 0.000842 ***
## studytime.factor>10h   -1.1548     0.5962  -1.937 0.052769 .  
## romanticyes            -0.2624     0.2598  -1.010 0.312486    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 435.69  on 373  degrees of freedom
## AIC: 453.69
## 
## Number of Fisher Scoring iterations: 4

As we anticipated at the beginning, higher age is shown to favor the probability of heavy alcohol use. The effect is significant at the .05 level. However, since only 4 different ages (15 through 18) are adequately represented, this is not much of a finding. It is to be expected that adults use more alcohol than kids. More interesting would be to look at the progression of people’s drinking habits from early adulthood to middle age and beyond, but that’s another topic.

My prediction on the effect of reason proves somewhat too bold. I assumed that people who choose a school near their homes are uninspired students and likely heavier-than-average drinkers. In our dataset, having chosen your school based on its convenient location increases the risk of being a heavy drinker by 32%. Here are the odds ratios and their confidence intervals:

ORs
##                       Estimate 2.5 % 97.5 %
## (Intercept)              0.014 0.000  0.364
## age                      1.261 1.036  1.544
## reasonhome               1.322 0.756  2.312
## reasonother              2.241 1.002  5.008
## reasonreputation         0.814 0.427  1.526
## studytime.factor2-5h     0.668 0.397  1.125
## studytime.factor5-10h    0.245 0.102  0.539
## studytime.factor>10h     0.315 0.085  0.931
## romanticyes              0.769 0.458  1.272

Though a 32% increase seems non-negligible, the effect falls way short of statistical significance. It is not a matter of a tiny sample either, since there were 110 people in this group. I guess I’ll just say that some correlation was found but it wasn’t strong enough to make confident statements.

Something intriguing was discovered with this variable, however. The students whose reason for picking the school they did was classified as “other” appear to be heavier drinkers than those with other reasons, and the effect is statistically significant at the .05 level. It is most unfortunate that the researchers did not document the “other” reasons. We are now left to speculate on what those other reasons might be and why they correlate with high alcohol consumption. Here’s a hypothesis: most of the “other” reasons for choosing a school are that a particular girl/guy is attending it. And if one is willing to attend an inconveniently located school over a girl/guy, then it stands to reason that one is also willing to resort to other somewhat drastic measures – such as liquid courage – to get him/her.

I was mostly right about studytime’s correlation with drinking habits. Compared to the reference group (<2h of study a week), each of the more studious groups proves less likely to be heavy drinkers. The strange thing is that it is not the people who study the most who drink the least – the strongest disfavoring effect is found in those who study from 5 to 10 hours a week. Their risk of high alcohol use is a whopping 76% smaller (an odds ratio of .24) than that of the laziest students, and the effect is significant at the highest level (p < .001). The most studious group also shows a strong disfavoring effect at -69%. Technically speaking, effect falls slightly short of the critical value of statistical significance (its p-value is .052), but that can be attributed to the small sample size (only 27 students) of this group. Also note that the confidence interval of the group’s Odds Ratio does not include 1! This is perhaps more intuitively undestood from the logodds table, which shows that the entire confidence interval of studytime.factor>10h is in the negatives:

round((confint(model)),digits=3)
## Waiting for profiling to be done...
##                        2.5 % 97.5 %
## (Intercept)           -7.668 -1.011
## age                    0.035  0.434
## reasonhome            -0.279  0.838
## reasonother            0.002  1.611
## reasonreputation      -0.851  0.423
## studytime.factor2-5h  -0.924  0.118
## studytime.factor5-10h -2.285 -0.617
## studytime.factor>10h  -2.463 -0.072
## romanticyes           -0.781  0.240

It is interesting, however, that it’s not the most studious group that drinks the least. We might hypothesize that those who study 5-10h a week live happy, balanced lives and do not need to drink themselves silly to relax and have fun, whereas those who study massive amounts need more extreme measures, such as intoxicants, in order to unwind.

Lastly, having a romantic partner proves statistically non-significant in predicting high alcohol use. The risk does decrease in a relationship, but only by 23%. P-value is .31; sample size is in the triple digits, i.e. large enough. I guess I was simply mistaken on this one.

3. Using Our Model for Predictions

We will now test the predictive power of our model. We’ll start by making a convenient “training copy” of our data, consisting of the 4 explanatory variables only. This way the real, observed outcomes are temporarily unknown and hidden, making the whole exercise more interesting!

alcohol.practice <- alcohol[c("age","reason","studytime.factor","romantic")]

Here’s a glimpse of the new data frame:

head(alcohol.practice)
##   age     reason studytime.factor romantic
## 1  15       home             >10h       no
## 2  15 reputation             2-5h      yes
## 3  15 reputation              <2h       no
## 4  15     course            5-10h       no
## 5  15 reputation            5-10h      yes
## 6  15     course            5-10h       no

The probability of the target outcome for a given data point (student) is calculated from the sum of the coefficient of Intercept and the coefficients of the predictor values corresponding to the data point. This sum is converted into an odds ratio, which in turn is converted into a probability. We’ll illustrate this by arbitrarily selecting a student from our dataset. For example, let’s pick Student #100 from the dataset and make a prediction on whether they’re a heavy drinker:

alcohol.practice[100,]
##     age reason studytime.factor romantic
## 100  17 course            5-10h       no

Let’s reprint the coefficients so we can work more easily:

as.data.frame(coef(model))
##                       coef(model)
## (Intercept)            -4.2949807
## age                     0.2321509
## reasonhome              0.2789380
## reasonother             0.8067722
## reasonreputation       -0.2054305
## studytime.factor2-5h   -0.4035197
## studytime.factor5-10h  -1.4077006
## studytime.factor>10h   -1.1547965
## romanticyes            -0.2623731

So the natural logarithm of the odds of this person using a lot of alcohol is:

(log_of_odds <- -4.2949807 + 17*0.2321509 + 0 + -1.4077006 + 0) #Get the logarithm of odds
## [1] -1.756116

And the odds are:

(odds <- exp(log_of_odds)) #Get the odds
## [1] 0.1727144

And the probability is:

(probability <- odds/(1+odds)) #Get the probability
## [1] 0.1472775

That’s quite a bit of calculus to do, so fortunately there’s a predict() function does all of it for us. We need only to specify the model, the data, and the type of our outcome variable:

predict(model, alcohol.practice[100,], type = "response")
##       100 
## 0.1472774

The slight discrepancy between the output of predict() and the manually calculated value is that the former uses exact values, while my manual computation uses the rounded values from the model printout.
Let’s now add a column to our practice data frame with the estimated probabilities of high alcohol use for every student, and look at a sample of the output:

alcohol.practice$probability <- predict(model, alcohol.practice, type="response")
alcohol.practice[seq(1,nrow(alcohol.practice),25),]
##     age     reason studytime.factor romantic probability
## 1    15       home             >10h       no   0.1559632
## 26   15     course              <2h       no   0.3073117
## 51   16       home              <2h      yes   0.3626220
## 76   16       home              <2h       no   0.4251593
## 101  17 reputation             >10h       no   0.1533398
## 126  17 reputation            5-10h       no   0.1232997
## 151  18     course             2-5h       no   0.3729025
## 176  15 reputation             2-5h      yes   0.1565611
## 201  15       home             2-5h       no   0.2814462
## 226  16       home              <2h       no   0.4251593
## 251  16     course              <2h       no   0.3588022
## 276  17     course             2-5h       no   0.3203996
## 301  17     course              <2h       no   0.4137666
## 326  18       home              <2h       no   0.5405788
## 351  18       home            5-10h       no   0.2235620
## 376  18       home             2-5h       no   0.4400777

And as you already guessed, the binary prediction on each student is that if the estimated probability of high alcohol use is greater than 50%, we predict that s/he is a heavy drinker; otherwise we predict the s/he isn’t. We will now add a column with the final prediction for every student and look at the output:

alcohol.practice$prediction <- alcohol.practice$probability > 0.5
alcohol.practice[seq(1,nrow(alcohol.practice),25),]
##     age     reason studytime.factor romantic probability prediction
## 1    15       home             >10h       no   0.1559632      FALSE
## 26   15     course              <2h       no   0.3073117      FALSE
## 51   16       home              <2h      yes   0.3626220      FALSE
## 76   16       home              <2h       no   0.4251593      FALSE
## 101  17 reputation             >10h       no   0.1533398      FALSE
## 126  17 reputation            5-10h       no   0.1232997      FALSE
## 151  18     course             2-5h       no   0.3729025      FALSE
## 176  15 reputation             2-5h      yes   0.1565611      FALSE
## 201  15       home             2-5h       no   0.2814462      FALSE
## 226  16       home              <2h       no   0.4251593      FALSE
## 251  16     course              <2h       no   0.3588022      FALSE
## 276  17     course             2-5h       no   0.3203996      FALSE
## 301  17     course              <2h       no   0.4137666      FALSE
## 326  18       home              <2h       no   0.5405788       TRUE
## 351  18       home            5-10h       no   0.2235620      FALSE
## 376  18       home             2-5h       no   0.4400777      FALSE

There! Now we have an estimated probability of high alcohol use for each student, plus a binary prediction based on that probability. We can finally unveil the real, actual outcomes observed in the dataset, and juxtapose them with the predicted outcomes for comparison:

alcohol.practice$real_outcome <- alcohol$high_use
alcohol.practice[seq(1,nrow(alcohol.practice),25),5:7]
##     probability prediction real_outcome
## 1     0.1559632      FALSE        FALSE
## 26    0.3073117      FALSE         TRUE
## 51    0.3626220      FALSE        FALSE
## 76    0.4251593      FALSE        FALSE
## 101   0.1533398      FALSE        FALSE
## 126   0.1232997      FALSE        FALSE
## 151   0.3729025      FALSE        FALSE
## 176   0.1565611      FALSE        FALSE
## 201   0.2814462      FALSE         TRUE
## 226   0.4251593      FALSE         TRUE
## 251   0.3588022      FALSE        FALSE
## 276   0.3203996      FALSE        FALSE
## 301   0.4137666      FALSE        FALSE
## 326   0.5405788       TRUE         TRUE
## 351   0.2235620      FALSE        FALSE
## 376   0.4400777      FALSE         TRUE

We can see that our model doesn’t seem to be doing a great job predicting outcomes. But let’s calculate the exact degree of its fallibility. We’ll first create a column reporting whether the prediction failed:

alcohol.practice$prediction.failed <- alcohol.practice$prediction!=alcohol.practice$real_outcome
alcohol.practice[seq(1,nrow(alcohol.practice),25),5:8]
##     probability prediction real_outcome prediction.failed
## 1     0.1559632      FALSE        FALSE             FALSE
## 26    0.3073117      FALSE         TRUE              TRUE
## 51    0.3626220      FALSE        FALSE             FALSE
## 76    0.4251593      FALSE        FALSE             FALSE
## 101   0.1533398      FALSE        FALSE             FALSE
## 126   0.1232997      FALSE        FALSE             FALSE
## 151   0.3729025      FALSE        FALSE             FALSE
## 176   0.1565611      FALSE        FALSE             FALSE
## 201   0.2814462      FALSE         TRUE              TRUE
## 226   0.4251593      FALSE         TRUE              TRUE
## 251   0.3588022      FALSE        FALSE             FALSE
## 276   0.3203996      FALSE        FALSE             FALSE
## 301   0.4137666      FALSE        FALSE             FALSE
## 326   0.5405788       TRUE         TRUE             FALSE
## 351   0.2235620      FALSE        FALSE             FALSE
## 376   0.4400777      FALSE         TRUE              TRUE

Crucially, R is so clever that it treats the logical values TRUE and FALSE numerically as 1 and 0, respectively. Thus calculating the error rate is easy as pie – it is simply the mean value of the column reporting TRUE for every student for whom the prediction failed.

(ErrorRate <- mean(alcohol.practice$prediction.failed))
## [1] 0.2879581

The model predicts correctly about 71% of the time.

4. Model Predictions vs. Educated Guessing

Now we can compare the model’s predictive performance to a strategy of simple but educated guessing. First we need to know the overall rate of high alcohol use in the entire dataset, i.e., its empirical probability.

(EmpiricalProbability <- mean(alcohol$high_use))
## [1] 0.2984293

So it’s slightly below 30%. An educated guesser would take this empirical probability and toss a commensurately biased coin for each student in the dataset, resulting in a sequence of predictions like this:

(alcohol.practice$EducatedGuesses <- sample(c(TRUE, FALSE), size = nrow(alcohol), replace = TRUE, prob = c(EmpiricalProbability,1-EmpiricalProbability)))
##   [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [23] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
##  [34] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [45]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [67] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [89] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## [100] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
## [122] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [133] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [166] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [177]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [199] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
## [210]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [221] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## [232] FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## [243]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [254] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [265]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [276]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [287] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [298]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [309] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
## [320] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## [331]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [342]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [353]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
## [364]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [375] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
alcohol.practice$EducatedGuessFailed <- alcohol.practice$real_outcome!=alcohol.practice$EducatedGuesses

Now then, we will simply cross-tabulate the outcome distributions of these two prediction strategies.

ModelPredictionFail <- table(alcohol.practice$prediction.failed) 
EducatedGuessFail<- table(alcohol.practice$EducatedGuessFailed)
(CrossTab <- rbind(ModelPredictionFail, EducatedGuessFail))
##                     FALSE TRUE
## ModelPredictionFail   272  110
## EducatedGuessFail     207  175

The chi-square test is perfect for these situations. It measures the total, aggregate difference between the frequency distributions of two (or more) samples, and reports the probability that an equal or greater difference would occur if both samples were from the same population. This probability is the p-value. In our case, the “populations” are the hypothetical populations of successful and failed predictions generated by the two guessing strategies. If the prediction strategies are equally good, then it is unlikely that their observed success rates in our dataset of 382 students would differ enough to result in a p-value below .05. Let’s find out:

chisq.test(CrossTab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  CrossTab
## X-squared = 22.923, df = 1, p-value = 1.686e-06

My output reports a p-value far below the highest significance level of .001, so it seems safe to assume that the model soundly outperforms the educated guessing strategy. But bear in mind that the educated guesses are re-generated from the empirical probabilities every time this page is loaded, so your results will not be identical to mine! Should you somehow get lucky and chance upon an instance of the educated guessing strategy outperforming the model (or even being competitive enough to result in a statistically non-significant p-value), feel free to take a screenshot and email it to me at juho decimalpoint ruohonen at helsinki decimalpoint eff eye.

5. Cross-Validating the Model

We’ll now go back to the full dataset with all the predictors intact – we will need them shortly. We’ll add a new column containing the probabilities calculated by our model, and specify an error rate calculation function ‘loss.func’ which will be needed in cross-validation.

alcohol$probability <- predict(model, alcohol, type="response")
loss.func<-function(class,prob){n_wrong<-abs(class-prob)>0.5;mean(n_wrong)}
loss.func(alcohol$high_use,alcohol$probability)
## [1] 0.2879581

Then we’ll perform a 10-fold cross-validation of the model. This means that we divide the dataset into 10 random, complementary subsets of equal size, then perform 10 cycles of the following: 1. Recalculate the model coefficients based on sub-datasets 1 through 9, then test the predictive performance of the model on sub-dataset 10 and store the error rate in memory. 2. Re-calculate the model coefficients based on sub-datasets 2 through 10, then test the predictive performance of the model on sub-dataset 1 and store the error rate in memory. 3. Re-calculate the model coefficients based on sub-datasets 3 through 10 and 1, then test the predictive performance of the model on sub-dataset 2 and store the error rate in a vector, etc. until every sub-dataset has been used for both model creation and model testing at least once. 4. Calculate the mean error rate from the individual error rates of each cycle.

All this can be done with a single function call if we have the library boot.

library(boot)
CrossVal <- cv.glm(data = alcohol, glmfit=model, cost = loss.func, K = 10)
CrossVal$delta[[1]]
## [1] 0.2984293

Now the error rate is of course higher than when we were simply testing the model’s predictive power on the exact same dataset on which it was built.

This model performs worse than the one based on failures, absences and sex that we used in the Datacamp exercises. However, I chose these variables for autodidactic reasons (to teach myself to work with both continuous and categorical ones) rather than to maximize prediction performance. I predict, however, that it shall be easy to come up with a 4-variable model with an error rate below 25%. Actually, that goal is too modest. Let’s shoot for a 3-variable model with an error rate lower than 25%.

To make the model-building and experimentation quicker and more convenient, we’ll write a function to minimize the typing and manual calculus involved:

logregr <- function(dataset = alcohol, model.formula){
  tmp.df <- dataset; attach(tmp.df,warn.conflicts = FALSE)
  logitmodel <- glm(data = tmp.df, formula = model.formula, family = "binomial")
  print(summary(logitmodel))
  cat("Logodds coefficients and their confidence intervals:\n"); print(cbind(data.frame(Estimate=coef(logitmodel)), confint(logitmodel)))
  cat("\nError rate in 10-fold cross-validation:\n", cv.glm(data = tmp.df, glmfit = logitmodel, cost = loss.func, K = 10)[[3]][1])
  detach(tmp.df)
}

Now we only need to specify the dataset and the formula stating the response variable and the predictors. The function will automatically print out the model summary, a table of the logodds-coefficients with their confidence intervals (which I find more intuitive than the odds ratios), and finally the error rate obtained in a cycle of 10-fold cross-validation.
So, let’s try to find a model that uses only 3 predictors and achieves an error rate lower than 25%.

logregr(alcohol, high_use ~ sex+famrel+studytime.factor)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5113  -0.8705  -0.6293   1.2387   2.1741  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             0.3277     0.5375   0.610  0.54203   
## sexM                    0.7458     0.2500   2.984  0.00285 **
## famrel                 -0.3160     0.1267  -2.495  0.01260 * 
## studytime.factor2-5h   -0.2685     0.2688  -0.999  0.31798   
## studytime.factor5-10h  -1.0123     0.4340  -2.332  0.01969 * 
## studytime.factor>10h   -1.2909     0.5993  -2.154  0.03124 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 434.07  on 376  degrees of freedom
## AIC: 446.07
## 
## Number of Fisher Scoring iterations: 4
## 
## Logodds coefficients and their confidence intervals:
## Waiting for profiling to be done...
##                         Estimate      2.5 %      97.5 %
## (Intercept)            0.3277477 -0.7299430  1.38518832
## sexM                   0.7458322  0.2594951  1.24132207
## famrel                -0.3160191 -0.5665000 -0.06815506
## studytime.factor2-5h  -0.2684703 -0.7942086  0.26156699
## studytime.factor5-10h -1.0122745 -1.9090880 -0.19217144
## studytime.factor>10h  -1.2908808 -2.6060013 -0.20350510
## 
## Error rate in 10-fold cross-validation:
##  0.3062827
logregr(alcohol, high_use ~ sex+famrel+Pstatus)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4113  -0.8667  -0.6679   1.2311   1.9401  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.02104    0.58866   0.036  0.97149    
## sexM         0.93116    0.23452   3.971 7.17e-05 ***
## famrel      -0.33016    0.12446  -2.653  0.00799 ** 
## PstatusT    -0.08714    0.38065  -0.229  0.81894    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 443.77  on 378  degrees of freedom
## AIC: 451.77
## 
## Number of Fisher Scoring iterations: 4
## 
## Logodds coefficients and their confidence intervals:
## Waiting for profiling to be done...
##                Estimate      2.5 %      97.5 %
## (Intercept)  0.02103570 -1.1487397  1.17072656
## sexM         0.93115696  0.4765509  1.39750348
## famrel      -0.33015614 -0.5765789 -0.08690103
## PstatusT    -0.08713609 -0.8145044  0.68972175
## 
## Error rate in 10-fold cross-validation:
##  0.2879581

Sigh. I wanted to find the better model without resorting too much to the silly variables used in the Datacamp example – that model predicts high alcohol use from school absenteeism and failures. Common sense suggests that high use of alcohol is a cause, not a result, of these phenomena. But in order to be able to move on, I’ll just use sex and absences as the first two variables, and try to find a third one that improves model performance relative to failures

logregr(alcohol, high_use ~ sex+absences+goout)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7897  -0.8155  -0.5441   0.8346   2.4743  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.16640    0.47692  -8.736  < 2e-16 ***
## sexM         0.96586    0.25451   3.795 0.000148 ***
## absences     0.08495    0.02239   3.794 0.000148 ***
## goout        0.72847    0.11992   6.074 1.24e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 388.10  on 378  degrees of freedom
## AIC: 396.1
## 
## Number of Fisher Scoring iterations: 4
## 
## Logodds coefficients and their confidence intervals:
## Waiting for profiling to be done...
##               Estimate       2.5 %     97.5 %
## (Intercept) -4.1664029 -5.14237353 -3.2687073
## sexM         0.9658600  0.47308974  1.4730023
## absences     0.0849536  0.04234994  0.1312952
## goout        0.7284671  0.49910441  0.9703140
## 
## Error rate in 10-fold cross-validation:
##  0.2198953

We have a a winner! Maleness, school absenteeism, and going out with friends are all significant correlates – though not necessarily causes – of high alcohol use in 15-to-22-year-old Portuguese students.

6. Manual Step-Down Regression

Though computers perform immensely complex calculations for us, they cannot interpret the results or draw conclusions. Interpretation is always the task of the researcher, who must rely on his/her knowledge of the subject matter to “connect dots”" emerging from calculations. In a similar vein, even a high coefficient obtained by a model is in itself no evidence whatsoever of a causal link between variables x and y.
In all variants of regression analysis, the computer is instructed to “explain” the variance of the y-variable as a function of the x-variable(s) and a constant, and it will make its best effort to do just that – regardless of how absurd that “explanation” may be. A prime example of this is the association between ice-cream consumption and drownings. Correlation is not causality. Therefore, we need to really know what we’re doing if we’re going to dump truckloads of predictors into the model – the more we dump in, the better the computer will be able to “explain” the variation. But unless we can plausibly explicate HOW a given x-variable with a significant p-value causes the change in the y-variable, we cannot claim a causal link.

Therefore, aside from prediction power, model economy is another key goal of regression analysis. We aim to explain as much of the variation of Y as possible with as few y-variables as possible, because every new y-variable we add to the model increases the amount of explaining we have to do – we want our explanation of a given phenomenon to be as succinct as possible.

The AIC score provides a measure of this. It is a function of residual deviance and model complexity, i.e., inversely correlated with prediction power and model simplicity. The lower the AIC score, the more accurate and/or economical the model.

With these guiding principles in mind, we will now practise manual step-down regression (also known as backwards elimination). We start with a bloated model featuring 10 predictors, and we will try to iteratively trim it down to just the most essential ones.

In order to make the printouts more succinct and less cluttered, I’ll use the original (numeric) version of studytime despite the questionability of its continuous nature. I will also leave absences out of the equation despite its strong correlation with the response variable: I remain unconvinced that it causes high alcohol use rather than being caused by high alcohol use. The following, then, are the 10 predictors that we begin with:

  1. Sex (binary)
  2. Frequency of going out (continuous)
  3. Number of past class failures (continuous)
  4. Internet access at home (binary)
  5. Parents’ cohabitation status (together or separated)
  6. Studytime per week – now treated as continuous
  7. Address (urban or rural, binary)
  8. Average grade in Math and Portuguese (continuous)
  9. Reason for selecting the current school (categorical, 4 levels)
  10. Having extra-curricular activities (binary)

Before beginning, we will finetune our logregr() function suit the task better. Rather than 10-fold cross-validation, we’ll tell it to use the more exhaustive and reliable “leave-one-out” technique, which divides the data into as many partitions as there are observations. This means that the R will rebuild the model on 381 of the total 382 observations and then try to predict the outcome of the one that was left out – 382 times – and calculate the mean error rate. The procedure is cpu-intensive – don’t try it on a 8086! The leave-one-out method is specified simply by omitting the K argument of the cv.glm call. Secondly, we’ll tell logregr() it to skip the printing of confidence intervals – showing us the model summary should suffice (note how easy it is to “deactivate” lines of code while still keeping them available for later use simply by prepending # to them). Lastly, let’s have the function output both the training error and the testing error in tabular format, so that we can easily document our model-trimming process.

logregr <- function(dataset = alcohol, model.formula){
  tmp.df <- dataset; attach(tmp.df,warn.conflicts = FALSE)
  logitmodel <- glm(data = tmp.df, formula = model.formula, family = "binomial")
  print(summary(logitmodel))
  #cat("Logodds coefficients and their confidence intervals:\n"); print(cbind(data.frame(Estimate=coef(logitmodel)), confint(logitmodel)))
  tmp.df$probability <- predict(logitmodel, tmp.df, type = "response")
  ErrorRates <- data.frame(TrainingError = loss.func(tmp.df$high_use, tmp.df$probability), TestingError = cv.glm(data = tmp.df, glmfit = logitmodel, cost = loss.func)[[3]][1])
  detach(tmp.df); cat("\nError rates:\n"); print.data.frame(ErrorRates); return(ErrorRates)
}

Alright, let’s start culling our oversized flock of variables.

Round1 <- logregr(alcohol, high_use ~ sex+goout+failures+internet+Pstatus+studytime+address+G3+reason+activities)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9951  -0.7682  -0.4803   0.7653   2.5695  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.590348   0.924000  -2.803  0.00506 ** 
## sexM              0.683332   0.273356   2.500  0.01243 *  
## goout             0.783408   0.124819   6.276 3.47e-10 ***
## failures          0.306092   0.262269   1.167  0.24317    
## internetyes       0.334722   0.391270   0.855  0.39229    
## PstatusT         -0.193650   0.430505  -0.450  0.65284    
## studytime        -0.374028   0.178834  -2.091  0.03648 *  
## addressU         -0.874049   0.329809  -2.650  0.00805 ** 
## G3                0.001252   0.044276   0.028  0.97744    
## reasonhome        0.477262   0.317827   1.502  0.13319    
## reasonother       0.970762   0.449768   2.158  0.03090 *  
## reasonreputation -0.049525   0.353233  -0.140  0.88850    
## activitiesyes    -0.396111   0.267416  -1.481  0.13854    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 376.70  on 369  degrees of freedom
## AIC: 402.7
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Error rates:
##   TrainingError TestingError
## 1     0.2172775    0.2356021

Grades in Math and Portuguese are clearly irrelevant. Out they go.

Round2 <- logregr(alcohol, high_use ~ sex+goout+failures+internet+Pstatus+studytime+address+reason+activities)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9976  -0.7691  -0.4802   0.7640   2.5717  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.57628    0.77844  -3.310 0.000934 ***
## sexM              0.68366    0.27308   2.504 0.012295 *  
## goout             0.78296    0.12381   6.324 2.55e-10 ***
## failures          0.30335    0.24370   1.245 0.213222    
## internetyes       0.33537    0.39061   0.859 0.390572    
## PstatusT         -0.19497    0.42790  -0.456 0.648644    
## studytime        -0.37348    0.17775  -2.101 0.035629 *  
## addressU         -0.87235    0.32426  -2.690 0.007139 ** 
## reasonhome        0.47726    0.31782   1.502 0.133180    
## reasonother       0.97080    0.44985   2.158 0.030925 *  
## reasonreputation -0.04928    0.35311  -0.140 0.889016    
## activitiesyes    -0.39558    0.26674  -1.483 0.138076    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 376.70  on 370  degrees of freedom
## AIC: 400.7
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Error rates:
##   TrainingError TestingError
## 1     0.2198953    0.2329843

AIC score went down a bit. Good. I find myself eyeballing the z-values column the most. Its absolute value is often a direct measure of the significance of the variable, and unlike the p-value, significance grows as distance from zero grows. One important thing to note, however, is the different treatment of continuous and categorical variables. With multi-class categorical variables, every level except the first one (the reference level) is listed as a “variable” of its own, with its own coefficients and other statistics. This means that we cannot extract the significance value of reason directly from this printout. Instead, we must calculate the range of z-value variation exhibited by all levels as a whole. In this example, reasonreputation has a z-value of -0.140 while reasonother has a 2.158. The total z-value of this variable, then, is about 2.3, indicating that this is quite significant a variable indeed.
The continuous variables are easier to interpret: the coefficient represents the effect of a one-unit change in the variable’s value, and the z-value represents the overall significance of the variable when all of its variation taken into account
The next variable to go is PstatusT. I’m a little surprised that the cohabitation status of the parents is showing so little of an effect. I assumed that children of intact nuclear families would drink considerably less than those of single parents. Guess not! Next round:

Round3 <- logregr(alcohol, high_use ~ sex+goout+failures+internet+studytime+address+reason+activities)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0061  -0.7694  -0.4835   0.7610   2.5605  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.7428     0.6894  -3.978 6.94e-05 ***
## sexM               0.6813     0.2731   2.495  0.01260 *  
## goout              0.7829     0.1238   6.323 2.57e-10 ***
## failures           0.2981     0.2435   1.224  0.22084    
## internetyes        0.3265     0.3901   0.837  0.40265    
## studytime         -0.3756     0.1775  -2.115  0.03440 *  
## addressU          -0.8652     0.3240  -2.670  0.00758 ** 
## reasonhome         0.4826     0.3176   1.520  0.12860    
## reasonother        0.9725     0.4490   2.166  0.03031 *  
## reasonreputation  -0.0427     0.3529  -0.121  0.90371    
## activitiesyes     -0.4036     0.2661  -1.517  0.12939    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 376.91  on 371  degrees of freedom
## AIC: 398.91
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Error rates:
##   TrainingError TestingError
## 1     0.2172775    0.2329843

AIC improved again.
I suppose internet is the next one to go. Its effect is totally contrary to my presumption. I assumed that people with no internet would drink more because they have fewer other entertainment options available. Not so. Live and learn…

Round4 <- logregr(alcohol, high_use ~ sex+goout+failures+studytime+address+reason+activities)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9874  -0.7710  -0.4786   0.7802   2.4739  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.56314    0.65082  -3.938 8.20e-05 ***
## sexM              0.68918    0.27284   2.526   0.0115 *  
## goout             0.78937    0.12362   6.385 1.71e-10 ***
## failures          0.28263    0.24232   1.166   0.2435    
## studytime        -0.37424    0.17751  -2.108   0.0350 *  
## addressU         -0.80010    0.31429  -2.546   0.0109 *  
## reasonhome        0.50294    0.31649   1.589   0.1120    
## reasonother       0.98488    0.44984   2.189   0.0286 *  
## reasonreputation -0.01745    0.35117  -0.050   0.9604    
## activitiesyes    -0.38086    0.26462  -1.439   0.1501    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 377.63  on 372  degrees of freedom
## AIC: 397.63
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Error rates:
##   TrainingError TestingError
## 1     0.2198953    0.2251309

AIC keeps modestly improving.
I think failures goes next. Not gonna miss it. It was another of those whose very inclusion as a predictor seemed slightly dubious due to uncertainty about the directionality of the link, i.e., which really causes which. Begone, foul variable!

Round5 <- logregr(alcohol, high_use ~ sex+goout+studytime+address+reason+activities)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8740  -0.7905  -0.4804   0.7770   2.5924  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.45141    0.64179  -3.820 0.000134 ***
## sexM              0.70513    0.27185   2.594 0.009493 ** 
## goout             0.80654    0.12321   6.546 5.91e-11 ***
## studytime        -0.41612    0.17424  -2.388 0.016928 *  
## addressU         -0.82719    0.31279  -2.645 0.008179 ** 
## reasonhome        0.51028    0.31477   1.621 0.104994    
## reasonother       0.95748    0.44811   2.137 0.032621 *  
## reasonreputation -0.03361    0.35053  -0.096 0.923606    
## activitiesyes    -0.39805    0.26353  -1.510 0.130924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 379.00  on 373  degrees of freedom
## AIC: 397
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Error rates:
##   TrainingError TestingError
## 1     0.2198953    0.2329843

Now things are getting harder. The p-value is just a rough guideline that one scientist made up in the 1930s when it was hard work to perform all these calculations manually over and over again. It is not an infallible litmus test of significance. All of the remaining variables seem at least somewhat relevant to me. It is therefore with a heavy heart that I remove activities from the list of predictors – I only do so to comply with the exercise instructions.

Round6 <- logregr(alcohol, high_use ~ sex+goout+studytime+address+reason)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9306  -0.7778  -0.4823   0.8296   2.5266  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.4862     0.6390  -3.891 9.98e-05 ***
## sexM               0.6266     0.2656   2.359  0.01830 *  
## goout              0.7897     0.1217   6.491 8.55e-11 ***
## studytime         -0.4513     0.1734  -2.603  0.00924 ** 
## addressU          -0.8104     0.3120  -2.598  0.00939 ** 
## reasonhome         0.5091     0.3125   1.629  0.10334    
## reasonother        0.9321     0.4493   2.075  0.03802 *  
## reasonreputation  -0.0993     0.3474  -0.286  0.77500    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 381.30  on 374  degrees of freedom
## AIC: 397.3
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Error rates:
##   TrainingError TestingError
## 1     0.2198953    0.2303665

We are essentially finished now. At this stage, it is no longer clear whether the current model is better than the previous one that included activities. The AIC score actually got worse after removing it, supporting my impression that it was no longer wise to keep eliminating predictors. On the other hand, Testing Error improved slightly. Frankly, I’m having a hard time wrapping my head around these numbers: both AIC and Residual Deviance increased, and both those statistics are supposedly derived (at least in part) from how well the model fits the data. Then how can prediction accuracy improve at the same time as those two measures become worse?
In any case, all the remaining predictors are significant from this point on – according to both statistical math and plain common sense. If we remove more, we are bound to witness an upswing in both error rate and AIC score:

Round7 <- logregr(alcohol, high_use ~ sex+goout+studytime+address)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7689  -0.8106  -0.4822   0.7589   2.4343  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.1931     0.6025  -3.640 0.000273 ***
## sexM          0.6525     0.2617   2.493 0.012652 *  
## goout         0.7727     0.1203   6.421 1.36e-10 ***
## studytime    -0.4965     0.1691  -2.936 0.003329 ** 
## addressU     -0.7287     0.3009  -2.421 0.015467 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 388.57  on 377  degrees of freedom
## AIC: 398.57
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Error rates:
##   TrainingError TestingError
## 1     0.2277487    0.2277487

Say what? AIC and Residual Deviance deteriorated further. Even Training Error went up, all of which was to be expected from removing a clearly significant predictor. Yet Testing Error improved again???? I am stumped. I can’t even. Basic common sense would suggest, however, that the more accurately a model predicts outcomes and the simpler it is, the better it is, no matter what any other fancy mathematical calculations might say. Therefore, we will keep playing this game for as long as Testing Error keeps improving. The next variable to go is address.

Round8 <- logregr(alcohol, high_use ~ sex+goout+studytime)
## 
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7173  -0.8038  -0.5212   0.8877   2.6290  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.7121     0.5691  -4.766 1.88e-06 ***
## sexM          0.6725     0.2585   2.602  0.00927 ** 
## goout         0.7482     0.1188   6.298 3.01e-10 ***
## studytime    -0.4866     0.1679  -2.898  0.00375 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 394.38  on 378  degrees of freedom
## AIC: 402.38
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Error rates:
##   TrainingError TestingError
## 1     0.2460733    0.2460733

WHOA stop right there! Enough! Now EVERYTHING got worse. That was clearly one step too far. We are finished. Let’s build a data frame reporting all the training and testing errors that we observed in each phase of this process, and then draw a chart of their progression:

(Errors<-cbind(PredictorsRemaining=10:3,rbind(Round1,Round2,Round3,Round4,Round5,Round6,Round7,Round8)))
##   PredictorsRemaining TrainingError TestingError
## 1                  10     0.2172775    0.2356021
## 2                   9     0.2198953    0.2329843
## 3                   8     0.2172775    0.2329843
## 4                   7     0.2198953    0.2251309
## 5                   6     0.2198953    0.2329843
## 6                   5     0.2198953    0.2303665
## 7                   4     0.2277487    0.2277487
## 8                   3     0.2460733    0.2460733
ggplot(data=Errors, aes(x=PredictorsRemaining)) + geom_line(aes(y=TrainingError, x=PredictorsRemaining-0.017, color="TrainingError")) + geom_line(aes(y=TestingError, color="TestingError",x=PredictorsRemaining+0.017)) + scale_x_reverse(breaks=10:3) + theme(panel.grid.minor=element_blank()) + scale_y_continuous(limits = c(.20,.26), breaks = c(.20,.21,.22,.23,.24,.25,.26)) + geom_point(aes(y=TestingError,x=PredictorsRemaining+0.017)) + geom_point(aes(y=TrainingError, x=PredictorsRemaining-0.017)) + ylab("Error Rate") + xlab("Number of Predictors in Model") +ggtitle("Error Rate Progression in Backward Elimination of Predictors", subtitle="N = 382")

Going by the simple and intuitive criteria of model economy and prediction power, there is no question about the winner. The model in Round 7, containing the predictors goout, sex, address and studytime, achieves by far the best combination of simplicity and accuracy. This partly contradicts the AIC and Residual Deviance test statistics, both of which implied the superiority of Model 5 (which also included reason and activities). But I will go with common sense here, awarding the win to Model 7. Here’s the winning function call:

logregr(alcohol, high_use ~ sex+goout+studytime+address)

This relatively laborious backwards-elimination process that we’ve just completed can be done automatically by many software tools. One of them is Rbrul(), a popular R program with an interactive interface that supports complex methods such as generalized mixed modeling. However, it is risky to rely on automated algorithms to identify significant variables. Computers can calculate correlations but they can’t detect causal relationships. They can’t decide which variables are meaningful to start the analysis with. A computer couldn’t have made my choice – based on real-world knowledge – to leave absences out of the initial set of variables due to its dubious causal status. At the time of this writing (2017), computers are still idiot savants – great at calculus, but ultimately dumb.


Clusters and Classification

This is the Clustering and Classification homework. We’ll start by taking a gander at the data.

library(MASS)
data("Boston")
str(Boston);View(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

These variable names tell us little about what they represent. I’ll just copypaste the descriptions from the original website:

crim

per capita crime rate by town.

zn

proportion of residential land zoned for lots over 25,000 sq.ft.

indus

proportion of non-retail business acres per town.

chas

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox

nitrogen oxides concentration (parts per 10 million).

rm

average number of rooms per dwelling.

age

proportion of owner-occupied units built prior to 1940.

dis

weighted mean of distances to five Boston employment centres.

rad

index of accessibility to radial highways.

tax

full-value property-tax rate per \$10,000.

ptratio

pupil-teacher ratio by town.

black

1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat

lower status of the population (percent).

medv

median value of owner-occupied homes in \$1000s.

Not even the host page tells us much about the purpose of this dataset. If I had to guess, I’d guess it was collected to analyze correlates and predictors of house prices. Here are the summaries:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Because Linear Discriminant Analysis is linear, it requires the explanatory variables to be continuous. That is doubtless why this dataset was chosen for this chapter – there are only numeric variables. Despite the use of the word linear, LDA does NOT require the predictor variables to be linear. It only requires that they can be represented numerically. Even our Boston dataset contains one binary variable, namely chas. Since LDA apparently places no requirements on the nature of the predictor variables, it is quite a versatile statistical tool indeed. What it does require is that the predictors be normally distributed and have the same variance. However, according to this page, there are valid work-arounds even when these requirements are not met.

We will now proceed to draw a plot of the correlations of each pair of variables:

library("corrplot");library("tidyverse")
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## select(): dplyr, MASS
(matriisi<-cor(Boston)) %>% corrplot(method="circle")

Circle diameter and darkness illustrate the strength of the variable’s correlation with the variable that intersects with it in the matrix. Blue color means a positive correlation, red is negative. The strongest correlations that catch my eye in this plot are:

Now it’s time to scale the data.

Boston.unscaled<-Boston #Backing up the unscaled version
Boston<-scale(Boston)
summary(Boston)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Now every variable has been standardized and centered by subtracting the mean from the value and dividing the remainder by the standard deviation of that variable. It makes comparisons of variance much easier and more intuitive. Next we convert crim(e rate) into a categorical variable AKA factor. This time we also do what is seldom advisable, i.e., delete the old, numeric variable afterwards:

Boston<-as.data.frame(Boston)
(kvantiilit<-quantile(Boston$crim))
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
Boston$crime.factor<-cut(Boston$crim, kvantiilit, include.lowest = TRUE, labels=c("low","low-ish","high-ish","high"))
Boston<-Boston[,-1]

The scale() function turned Boston into a matrix object, which doesn’t behave quite the way we want here, so I had to convert it back into a data frame before being able to proceed. Don’t ask why scale() does that.

Now we must randomly split the data into a training set consisting of 80% of the observations and a testing set comprising the remaining 20%:

set.seed(10)
TrainingRows<-sample(nrow(Boston),size=nrow(Boston)*.8) #Randomly select 80% of all rows
TrainingSet<-Boston[TrainingRows,] #Create training set
TestingSet<-Boston[-TrainingRows,] #Create testing set
head(TrainingSet)
##             zn       indus       chas         nox         rm        age
## 257  3.3717021 -1.07673449 -0.2723291 -1.38676461  1.6642999 -1.2211827
## 155 -0.4872402  1.23072696  3.6647712  2.72964520 -0.2215067  0.9742880
## 216 -0.4872402 -0.07970124 -0.2723291 -0.56693456 -0.1460744 -0.9298742
## 349  2.9429307 -1.33036576 -0.2723291 -1.03294322  0.4986579 -1.3810470
## 43  -0.4872402 -0.61611679 -0.2723291 -0.92075595 -0.1645767 -2.2016841
## 113 -0.4872402 -0.16424500 -0.2723291 -0.06640675 -0.5289287  0.8641592
##             dis        rad         tax     ptratio      black      lstat
## 257  1.20674602 -0.7521778 -0.97448656 -1.18041474  0.3249467 -1.3363648
## 155 -0.97147402 -0.5224844 -0.03107419 -1.73470120 -0.3905371  0.3454580
## 216  0.07140456 -0.6373311 -0.77868399  0.06672981  0.4047979 -0.4457409
## 349  2.16029607 -0.6373311 -0.76088376 -0.67231881  0.3753329 -0.9330634
## 43   0.91458805 -0.7521778 -1.03975408 -0.25660396  0.2924148 -0.9582698
## 113 -0.68463492 -0.4076377  0.14099473 -0.30279450  0.4192565  0.4980964
##           medv crime.factor
## 257  2.3341253          low
## 155 -0.6015814     high-ish
## 216  0.2682577      low-ish
## 349  0.2138927          low
## 43   0.3008766      low-ish
## 113 -0.4058676      low-ish
head(TestingSet)
##            zn      indus       chas        nox         rm        age
## 1   0.2845483 -1.2866362 -0.2723291 -0.1440749  0.4132629 -0.1198948
## 3  -0.4872402 -0.5927944 -0.2723291 -0.7395304  1.2814456 -0.2655490
## 4  -0.4872402 -1.3055857 -0.2723291 -0.8344581  1.0152978 -0.8090878
## 5  -0.4872402 -1.3055857 -0.2723291 -0.8344581  1.2273620 -0.5106743
## 14 -0.4872402 -0.4368257 -0.2723291 -0.1440749 -0.4776917 -0.2406812
## 16 -0.4872402 -0.4368257 -0.2723291 -0.1440749 -0.6413655 -0.4289659
##          dis        rad        tax    ptratio     black      lstat
## 1  0.1400750 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990
## 3  0.5566090 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 4  1.0766711 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708
## 5  1.0766711 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866
## 14 0.4333252 -0.6373311 -0.6006817  1.1753027 0.4406159 -0.6151835
## 16 0.3341188 -0.6373311 -0.6006817  1.1753027 0.4265954 -0.5857761
##          medv crime.factor
## 1   0.1595278          low
## 3   1.3229375          low
## 4   1.1815886          low
## 5   1.4860323          low
## 14 -0.2318998     high-ish
## 16 -0.2862647     high-ish

Linear Discriminant Analysis

So now we will follow a procedure very similar to the Cross-Validation seen in the previous chapter. We will first build a model on the training dataset, then we will test its predictive power on the testing dataset. A new, useful detail is that if you want to use all the variables found in the data frame as predictors, you can use a single dot as a wildcard for that:

(lda.fit<-lda(crime.factor ~ ., data = TrainingSet))
## Call:
## lda(crime.factor ~ ., data = TrainingSet)
## 
## Prior probabilities of groups:
##       low   low-ish  high-ish      high 
## 0.2376238 0.2524752 0.2475248 0.2623762 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.9594165 -0.8884546 -0.14929469 -0.8605362  0.40702138
## low-ish  -0.1166892 -0.3020359  0.03646311 -0.5618582 -0.09366532
## high-ish -0.3911954  0.2128639  0.20012296  0.3838094  0.08903229
## high     -0.4872402  1.0149946 -0.08661679  1.0257502 -0.39221614
##                 age        dis        rad        tax     ptratio
## low      -0.8694440  0.8295029 -0.6863802 -0.7605747 -0.46686714
## low-ish  -0.3840715  0.3607010 -0.5528850 -0.4794539 -0.07048326
## high-ish  0.4321347 -0.4042103 -0.3892622 -0.2854989 -0.29725164
## high      0.8311809 -0.8556179  1.6596029  1.5294129  0.80577843
##               black       lstat        medv
## low       0.3754402 -0.75841326  0.50644405
## low-ish   0.3219624 -0.16902036  0.02254946
## high-ish  0.1455522  0.06834233  0.18486185
## high     -0.8015896  0.90526853 -0.73923748
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.134711446  0.815076287 -1.02611246
## indus    0.033748436 -0.336884258  0.15341827
## chas    -0.003030025 -0.059884611  0.11849430
## nox      0.267076396 -0.592218405 -1.25727830
## rm       0.055354251 -0.106946680 -0.07288027
## age      0.310296486 -0.303533896 -0.23368301
## dis     -0.148881410 -0.295690803  0.24054891
## rad      3.460334684  0.875004531 -0.16525987
## tax      0.009992405 -0.072742670  0.73620027
## ptratio  0.122422773  0.125987461 -0.20686483
## black   -0.143035765 -0.007637357  0.10294691
## lstat    0.133606922 -0.351329562  0.32275482
## medv    -0.002463537 -0.509212123 -0.21656505
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9558 0.0336 0.0106

Linear Discriminant Analysis is a pattern-recognition and dimensionality-reduction technique. It is used to to identify patterns in the behavior of a set of continuous explanatory variables in 2 or more known categories/groups (the dependent variable), and distill these different behaviors into a set of “discriminants” that are specific combinations of values that the explanatory variables tend to take in each category/group.

For instance, imagine you’re an entrepreneur selling a single product that you’ve worked years on. To improve the success rate of your marketing efforts, you might gather all the available information on the people you’ve ever solicited in your career, and conduct a LDA to identify what the people who actually buy from you have in common. Say you’re able to access the information on the following set of explanatory variables: age, income level, BMI, number of children, hair length, size of current home town – that’s 7 explanatory variables. You could then use LDA to determine which of those 7 variables are actually significant in differentiating buyers from non-buyers and which aren’t. The significant ones have an LD coefficient with a high absolute value, while the coefficients of the non-significant differentiators are closer to zero. In our hypothetical scenario, for example, LDA might reveal that the people who buy from you tend to be over the age of 30, have more than 3 kids, live in very small towns, and have medium-to-low income. That is their “linear discriminant”, and you can multiply your sales by focusing your marketing on people who match that profile.

In the exercise at hand, we’re trying to identify the discriminants of the 4 different levels of crime rate. I’m not particularly enamored with this categorization: I view this dependent variable as essentially continuous, and its conversion into a factor seems somewhat artificial to me. But the instructions are what they are, and obey we must. Here’s an arrow bi-plot of the LDA:

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(TrainingSet$crime.factor)
plot(lda.fit, dimen = 2, col=classes, pch=classes);lda.arrows(lda.fit, myscale = 1.5)

Frankly, I don’t get much out of this plot. It’s too messy and cluttered for me to fully wrap my head around.

Our next step is using the model for predictions. We’ll start by storing the actual crime categories of the testing set in a separate vector and then removing that column from the dataset, in order to make the prediction game more interesting:

correct.classes<-TestingSet$crime.factor
TestingSet<-TestingSet[,which(colnames(TestingSet)!="crime.factor")]

Then we make predictions in a familiar way…

lda.pred <- predict(lda.fit, newdata = TestingSet)

…followed by a cross-tabulation of the predicted and observed outcomes:

table(real = correct.classes, predicted = lda.pred$class)
##           predicted
## real       low low-ish high-ish high
##   low       19      11        1    0
##   low-ish    5      14        5    0
##   high-ish   1      11       13    1
##   high       0       0        1   20

This shows us that the model does well at predicting low and high crime rates – a useful model, in other words. The intermediate categories cause much more difficulty though.

K-Means Cluster Analysis

Cluster analysis is used to identify meaningful categories within a population. A meaningful category is a subgroup of the population whose members, in one way or another, behave similarly to other members of that same group, and dissimilarly from the members of the other subgroups.

As humans, we have a built-in pattern recognition firmware that helps us place the real-world phenomena that we encounter into pre-determined categories and respond to them accordingly. Therefore, the utility of cluster analysis might not be quite as intuitively obvious as that of a technique such as regression analysis or LDA. An example of a situation where cluster analysis may be useful is if, say, we discover a new species of deep-sea fish. At the beginning, we will have very little knowledge of the scale of variation – morphological or otherwise – present in the species. We might even have stumbled upon an entire new taxon of fish comprising more than one separate species! In cluster analysis, we would gather measurements on a set of characteristics from as large a sample of the population as possible – say, weight, length, number of scales, tooth length, etc. – and then use cluster analysis to see how many meaningful subgroups the population seems to divide into. After doing so, we might summon a board of biologists to decide whether these subgroups represent a set of sub-species of a single new species, or a set of more than one new species. Then we would proceed to give them names, descriptions, and so on.

In the present exercise, we will use cluster analysis to see how many meaningful group distinctions emerge among the different Boston districts represented in our dataset. To do this, we need a version of the Boston dataset where crim is back in numeric format, and everything has been scaled:

Boston.adulterated<-Boston #Back up the version in which crime is a factor.
Boston<-as.data.frame(scale(Boston.unscaled)) #Scale the original version of Boston and make sure the result is a data frame:
head(Boston)
##         crim         zn      indus       chas        nox        rm
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916
##          age      dis        rad        tax    ptratio     black
## 1 -0.1198948 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159
## 2  0.3668034 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159
## 3 -0.2655490 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351
## 4 -0.8090878 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514
## 5 -0.5106743 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159
## 6 -0.3508100 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651
##        lstat       medv
## 1 -1.0744990  0.1595278
## 2 -0.4919525 -0.1014239
## 3 -1.2075324  1.3229375
## 4 -1.3601708  1.1815886
## 5 -1.0254866  1.4860323
## 6 -1.0422909  0.6705582

Then we calculate the Euclidean distance between each town:

dist_eu<-dist(Boston)

Finally, we apply the clever algorithm shown in the Datacamp exercise in order to visually identify the optimal (most succinct and informative) number of of clusters (groups) into which the dataset divides:

k_max <- 10;twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss});plot(1:k_max, twcss, type='b')

Here we face the same dilemma as in stepwise regression: we can see that increasing the number of clusters decreases residual variance, which is the goal. But more clusters means more categorizations, i.e., more complexity, more explaining to do, and more cognitive load. Therefore, we will abide by the rule of thumb – the optimal number of clusters is the one that brings about the most radical change in residual variance – and declare that 2 is the winning number. If we were seriously pursuing this line of investigation, we would now examine the clusters, identify their shared characteristics, and perhaps give them informative names or labels. But those are not the instructions. The instructions are to visualize the clusters. Hence now we visualize the clusters:

km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)

I don’t get much out of this plot, either. There’s too much stuff jammed into a tight space.

Analyzing 4 Clusters

We’ll now perform this analysis in reverse. We’ll take 4 clusters – this seems a relatively good number, judging from the second-to-last-plot – and perform LDA on Boston with the 4 clusters as target classes, in order to try and identify their common characteristics.

km4<-kmeans(dist_eu, centers = 4)
Boston$cluster<-km4$cluster
(ld4.4c<-lda(cluster ~ ., data=Boston))
## Call:
## lda(cluster ~ ., data = Boston)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2707510 0.2173913 0.1897233 0.3221344 
## 
## Group means:
##         crim          zn      indus        chas        nox          rm
## 1  0.9759730 -0.48724019  1.0836745  0.07252643  1.2746122 -0.51504943
## 2 -0.2781708 -0.48236779  0.4449338 -0.27232907  0.1112885 -0.34491541
## 3 -0.3904431  1.29327757 -0.8829277  0.67093453 -0.7094969  1.05013774
## 4 -0.4026194 -0.02663978 -0.6910740 -0.27232907 -0.7285393  0.04717328
##          age        dis        rad         tax    ptratio      black
## 1  0.8854486 -0.9186628  1.3368585  1.35643451  0.5296467 -0.8273642
## 2  0.4730244 -0.4156930 -0.2959232 -0.05777454  0.4757260  0.1720595
## 3 -0.6459300  0.7595211 -0.6074231 -0.64709824 -0.9518678  0.3430417
## 4 -0.6830053  0.6053314 -0.5661684 -0.71996870 -0.2055960  0.3772414
##        lstat       medv
## 1  0.9917971 -0.7561841
## 2  0.2721432 -0.3240236
## 3 -0.8065795  1.1616548
## 4 -0.5422106  0.1700672
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## crim    -0.08347580  0.25229152  0.23494161
## zn      -0.50086410  1.07427556 -0.73572327
## indus   -0.19066361 -0.38532493 -0.51849315
## chas     0.05133571  0.66311801 -0.02000862
## nox     -1.48554172  0.53651016  0.37267367
## rm       0.20818955  0.40857906 -0.18625561
## age     -0.15978026  0.02494374 -0.66480469
## dis      0.22369038  0.20325447  0.52351112
## rad     -0.22891423 -0.12421387  1.52917037
## tax     -1.04859002  0.58543578 -0.69932224
## ptratio -0.45539412 -0.13690964 -0.60978717
## black    0.28206885 -0.13889043 -0.08150766
## lstat   -0.43261121  0.62600094 -0.27284565
## medv    -0.11825412  0.74838356 -0.33152714
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8304 0.1374 0.0321

Judging from the coefficients of the Linear Discrimimants (LDs), it appears that air pollution is by far most significant discriminating variable between the clusters, followed by property tax rate. However, it is not entirely clear to me how these coefficients are to be interpreted and applied. There are always K-1 of them, where K is the number of classes/clusters. Is it the case that LD1 differentiates the first cluster from the second, while LD2 differentiates the second cluster from the third? Hmmm… if the treatment of factors in regression analysis is any indication, then it seems likelier that one of the classes/clusters is used as the reference cluster, and the Linear Discriminants differentiate the other clusters from this reference cluster. This would explain why Proportion of Trace (=explanatory power) is highest for LD1 and progressively much smaller for the other LDs – we already saw in the previous section that 2 is the optimal number of clusters, so the discriminant differentiating these two main clusters is by far the most significant one. Our current model has 4 clusters, and the drastically lower Proportion of Trace (explanatory power) of the other Linear Discriminants compared to the first one reflects the fact that adding new clusters after the first two yields diminishing returns. In other words, any additional classifications of the data that are done after the initial 2-way classification are much less helpful than the original 2-way classification. They do increase the overall predictive power, but it is debatable whether this improvement is enough to justify the resulting increase in model complexity.

However, what exactly do the coefficients represent? How are they calculated? In linear and logistic regression, we can manually predict the value of the response variable using the coefficients. Can we similarly use the linear discriminant coefficients to predict which cluster a given data point belongs to? If so, how exactly is this done? I have yet to find an answer to this, neither by Googling nor in any of the course materials.

At any rate, we can still use the Group Means data to improvise some hypotheses on the four clusters that have been identified:

  • Cluster 1: Lots of crime, lots of industry, lots of pollution, apartments are old and small, close proximity to central traffic routes and employment centers, and very few blacks in the population. Hmmm….this sounds like an old-fashioned white working-class neighborhood – exactly the kind of place that is run by archetypcal gangsters in trench coats and fedoras. Think Al Capone, only in Boston instead of Chicago. In fact, if you’ve seen this film, then you know exactly the kind of place I’m envisioning.

  • Cluster 2: The main difference between this one and Cluster 1 is in the dramatic reduction in crime and the fact that blacks are allowed in this neighborhood. Otherwise the two clusters seem to fit a similar profile. Dis and rad indicate that this is a type of suburb that tends to be a bit farther from downtown but is overall still quite conveniently located. Houses are still old and small, though not so old as in Cluster 1. Therefore, I conjecture that this cluster represents the 2nd layer, or ring, of the city that was built around the core once the core became too cramped to accomodate the growing population. These places are still relatively blue-collar but they are no longer run by gangsters like Cluster 1 is. The absence of racist gangsters enables the entry of non-white ethnic groups into the neighborhood.

  • Cluster 3: These are suburbs proper – sub-urban in the most authentic sense of the word. They represent the Outer Ring of a city – relatively quiet and peaceful places with significantly less economical and industrial activity than in the previous two clusters. There is more space, so houses are bigger and tend to be new. People move here because of the increased peace and safety relative to downtown – typically when they have their first children. The flipside of the increased peace and safety is the reduced access to many services and entertainment options – you have to drive or use public transportation to access some of the shiniest and trendiest goodies. But couples with small children generally don’t care much for that kind of thing.

  • Cluster 4: These are also outer suburbs, but they differ from Cluster 3 in one important respect: there are no families. These are districts constructed to provide affordable housing to newcomers who are young, single, and seeking career opportunities. Apartments are new and small, so they cost more than in Clusters 1 and 2 but significantly less than in Cluster 3.

Now it’s time to draw one of those goofy arrow plots again – biplot is what I think it’s called. Sigh. Do I really have to?

luokat <- Boston$cluster
plot(ld4.4c, dimen = 2, col=luokat, pch=luokat);lda.arrows(ld4.4c, myscale = 1.5)

Try as I might, I just can’t extract a whole lot of useful information from this plot. Maybe I could if there were fewer variables and the picture didn’t get so cramped. But the instructions were what they were. Pretty much the only bits of data that I’m able to glean from this plot are these two:

1.Air pollution correlates with Cluster 1. Well that’s to be expected of the core regions of the city, where the bulk of cars and businesses are.

2.Property Tax rates also correlate with Cluster 1. This one is harder to decipher. I imagine it might be because being close to downtown is valued highly, especially by businesses, so those properties are also taxed more heavily regardless of their condition.

As the variables that are the most influential linear separators of the clusters, I can’t answer that confidently since I’m still unclear on how to interpret the Linear Determinant coefficients. But if the Group Means are any indicator (and they darn well should be), then I’d say the biggest differentiators are tax, indus, rad and nox. And I would also say that these variables likely reflect the the Downtown-Suburban dichotomy. This would also explain why the WCSS by Clusters plot earlier showed 2 to be the ideal number of clusters. It was probably showing us this same dichotomy that it was capturing, and our present 4-cluster model just adds a bit of extra nuance to it – at the cost of increased model complexity.

Visualizing Linear Discriminants

Another reason why it is inconvenient to have more clusters than strictly necessary is that spatial-geometrical visualization of the Linear Discriminants differentiating the clusters becomes impossible if the number of clusters exceeds 4. Computer screens are 2d, so visualizing the third dimension necessary to represent the third discriminant used in 4-cluster models is tricky and computationally expensive. And more than three discriminants cannot be geometrically represented at all – at least not in static images. To complete this chapter, we will now attempt to visually plot the three Linear Discriminants defined by the LD model that we built to find differentiating characteristics between the 4 categories of crime rate. This 3d plot is a product of the plotly package:

model_predictors<-TrainingSet[,which(colnames(TrainingSet)!="crime.factor")]
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=TrainingSet$crime.factor)

As you can see, the colors of the balls represent the different crime rates. The High Crime group is quite separate from the other three groups, with little overlap, while the three other groups overlap significantly among themselves. Our next step is to draw a similar plot, this time coloring the balls according to the Cluster that they were assigned to in the previous exercise:

TrainingSet$four.clusters<-as.factor(km4$cluster[TrainingRows])
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=TrainingSet$four.clusters)

We can see here that there is significant correspondence between the unnamed clusters into which we divided the dataset and the four different categories of crime rate. The cluster that we hypothetically branded the Working-Class Gangster Neighborhood is also host to the majority of the High Crime districts. The equivalence is largely lost outside the High Crime group though – the motley crew formed by the remaining data points contains units from all four clusters, and their alignment is not very systematic. We can therefore make the tentative suggestion that high crime rate is a significant factor in the categorization of suburbs into distinct classes. But we’re talking high crime rate specifically. When crime rate drops below a certain threshold, it seems to lose its differentiating power.


Dimensionality Reduction

In this chapter we deal with data that represents countries, not people or towns. This data was collected to analyze correlations between what the UN calls ‘human development’ and a variety of other factors. Should be relatively interesting! We start by importing the dataset.

human <- read.table("C:/Users/juho/Desktop/IODS-project/IODS-project/data/human.txt", header=T, sep="\t",quote=NULL)

Then we’ll visualize it:

library(ggplot2)
library(GGally)
ggpairs(human, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20))) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Many of the variables are highly correlated with each other, as is to be expected. Some of these correlations are self-evident, such as the one between life expectancy and maternal mortality. Adolescent birth rates seem negatively correlated with life expectancy. In this case I suspect it is probably life expectancy that drives adolescent birth rates and not vice versa – a high risk of dying young instills urgency to the endeavor of continuing the species, while a long life expectancy probably makes people want to take their time choosing the right partner and circumstances for breeding. Gross National Income and life expectancy both correlate with expected length of education. In this case it is hard to say which one is causing which in each pair.

Principal Component Analysis (PCA)

PCA is used to try and summarize, or condense, the overall variation of a set of variables into a minimal number or “principal components”, i.e., bundles of mutually correlated variables. Each bundle must be orthogonal (uncorrelated) with the other bundles, while all variables within one bundle must correlate with each other, positively or negatively. Thus PCA is similar to LDA. The difference is that where LDA aims to summarize differences between different user-specified groups, PCA aims to summarize the overall variance. Let’s now perform PCA on the human dataset:

(pca<-prcomp(human))
## Standard deviations:
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation:
##                     PC1           PC2           PC3           PC4
## Edu2.FM    5.607472e-06 -0.0006713951 -3.412027e-05 -2.736326e-04
## Labo.FM   -2.331945e-07  0.0002819357  5.302884e-04 -4.692578e-03
## Edu.Exp    9.562910e-05 -0.0075529759  1.427664e-02 -3.313505e-02
## Life.Exp   2.815823e-04 -0.0283150248  1.294971e-02 -6.752684e-02
## GNI        9.999832e-01  0.0057723054 -5.156742e-04  4.932889e-05
## Mat.Mor   -5.655734e-03  0.9916320120  1.260302e-01 -6.100534e-03
## Ado.Birth -1.233961e-03  0.1255502723 -9.918113e-01  5.301595e-03
## Parli.F    5.526460e-05 -0.0032317269 -7.398331e-03 -9.971232e-01
##                     PC5           PC6           PC7           PC8
## Edu2.FM    0.0022935252  2.180183e-02 -6.998623e-01  7.139410e-01
## Labo.FM   -0.0022190154  3.264423e-02 -7.132267e-01 -7.001533e-01
## Edu.Exp   -0.1431180282  9.882477e-01  3.826887e-02  7.776451e-03
## Life.Exp  -0.9865644425 -1.453515e-01 -5.380452e-03  2.281723e-03
## GNI        0.0001135863 -2.711698e-05  8.075191e-07 -1.176762e-06
## Mat.Mor   -0.0266373214  1.695203e-03 -1.355518e-04  8.371934e-04
## Ado.Birth -0.0188618600  1.273198e-02  8.641234e-05 -1.707885e-04
## Parli.F    0.0716401914 -2.309896e-02  2.642548e-03  2.680113e-03
(pca.summary<-summary(pca))
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

I can’t make much sense of these results. It claims that only the first principal component (PC1) is meaningful. But this is probably because the analysis was done without scaling the dataset first. PCA assumes that variables with a high variance are more meaningful than ones with a small variance. The variables in human are measured in different units with different ranges – most are percentages that vary between 0 and 1, while use units like dollars or euros per capita and range between zero and infinity. It is to be expected that this will confuse PCA. But let’s nonetheless comply with the instructions and draw a biplot using the first two principal components as axes:

prossat<-round(100*pca.summary$importance[2, ], digits = 1)
laabelitjaleibelit<-paste0(names(prossat), " (", prossat, "%)")
biplot(pca, cex = c(0.8, 1), col = c("blue", "red"), xlab = laabelitjaleibelit[1], ylab = laabelitjaleibelit[2], main="Human development (unscaled) by two principal components")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

This also seems to support the notion that performing PCA on variables using different units and ranges doesn’t make much sense. You can even see how R complains about the virtual non-existence of the other principal components than the first. Their effects are virtually nullified by the disproportionate ranges of the variables in PC1.

Now let’s perform the same analysis and plotting on a scaled version of the dataset, and admire the results:

human.scaled<-scale(human)
pca.scaled<-prcomp(human.scaled)
(pca.scaled.summary<-summary(pca.scaled))
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
rosentit<-round(100*pca.scaled.summary$importance[2, ], digits = 1)
akselit<-paste0(names(rosentit), " (", rosentit, "%)")
biplot(pca.scaled, cex = c(0.8, 1), col = c("darkgreen", "purple"), xlab = akselit[1], ylab = akselit[2], main="Human development (scaled) by two principal components")

There. Though the high density of data points makes the plot messy, now we can actually get at least SOME useful information out of it. Principal Component 1, i.e. the most important combined element of variance in the dataset, is strongly positively correlated with life expectancy and expected years of education. It is strongly negatively correlated with the rates of teen pregnancy and maternal mortality. Principal Component 2 is strongly correlated with the level of female participation in politics and the workforce. That is why one end of this spectrum seems to consist almost exclusively of arab countries, where women’s participation in public forums is heavily restricted.

Multiple Correspondence Analysis (MCA)

Now we shall attempt something similar to PCA, but with categorical variables. Our dataset is about people’s tea consumption habits. We begin by loading the necessary libraries – then we’ll import the data and survey it:

library(FactoMineR)
data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 

This is a tad too many variables to visualize on a plot without cluttering it. We’ll select only 6 of them. Before removing variables, however, we’ll back up the original dataset first however.

tea.backup<-tea
tea<-tea[,c("Tea","How","how","where","sugar","lunch")]
head(tea)
##         Tea   How     how       where    sugar     lunch
## 1     black alone tea bag chain store    sugar Not.lunch
## 2     black  milk tea bag chain store No.sugar Not.lunch
## 3 Earl Grey alone tea bag chain store No.sugar Not.lunch
## 4 Earl Grey alone tea bag chain store    sugar Not.lunch
## 5 Earl Grey alone tea bag chain store No.sugar Not.lunch
## 6 Earl Grey alone tea bag chain store No.sugar Not.lunch

Now we’ll plot this thing using the code from Datacamp:

library(tidyr)
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
Bars illustrating distributions of factor levels

Bars illustrating distributions of factor levels

These plots plainly show us that the most popular form of tea is a regular tea bag, bought in a regular grocery store and served black. This is clearly a British dataset since Earl Grey is the preferred variety. Most people don’t have tea with lunch, and the use vs. non-use of sugar are evenly distributed. Next, we will apply MCA on this data and draw a biplot:

tea.mca<-MCA(tea,graph=F) 
(mca.summary<-summary(tea.mca))
## 
## Call:
## MCA(X = tea, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## where              | 0.702 0.681 0.055 |
## sugar              | 0.065 0.001 0.336 |
## lunch              | 0.000 0.064 0.111 |
## NULL
plot(tea.mca, invisible=c("ind"),habillage="quali")

The Dim’s apparently correspond go the Principal Components of PCA. The % of var and Cumulative % of var seem to correspond to Proportion of Variance and Cumulative Proportion in PCA. It appears that these principal components do a relatively poor job of explaining the variance of the dataset – the first two together only account for 30% of the total variance, while the first two PC’s in our previous exercise accounted for 70%.

The factor levels unpackaged and tea bag both play as huge role in the first principal component, as shown by their massively significant v.test values (which are supposedly equivalent to z-scores). Both of them are levels of the same variable – namely, how and thus measure the same thing. I guess this is because generally, whether you have your tea in a teabag or unpackaged is an indicator of whether you’re having it at home or in a café – and those things have many implications for a number of other variables too. This also explains why, on the biplot, unpackaged and tea shop are the two items most different from the rest yet similar to each other. Both of them are factor levels closely tied to the action of having tea outside your home. The same things explains why in the Categorical variables table, how and where are by far the two most important contributors to this principal component. They are flipsides of the same coin.

Tea type is the second-most important element in the first principal component. This is somewhat surprising. Within this variable, Earl Grey is the most influential factor level. This might suggest that people who drink this variety are different from the average population. An easy, if somewhat intellectually lazy, assumption is that the Earl Grey drinkers are stereotypical upper-class Englishmen impeccable Oxford accents, whose culture and lifestyle is very different from the common man.